The focus of our dashboard is to explore the interactions between multiple factors collected from NCAA Basketball teams this season. We gathered our data from two online sources. Our first data set was web scraped from “kenpom.com” and our second data set was downloaded from “sports-reference.com”. We merged the two datasets on the name of the college in order to increase our number of variables, allowing us to have more avenues to go down when doing our analysis. Our final data set has 41 variables and 362 observations.
| Varaible | Variable Description |
|---|---|
| Team | Team Name |
| Rk | National Rank |
| Conf | Conference |
| Wins | Number of Games Won |
| Losses | Number of Games Lost |
| EM | Efficiency Margin |
| OE | Offensive Efficiency |
| DE | Defensive Efficieny |
| Tempo | Tempo (amount of possesions per game) |
| Luck | Luck Rating |
| SOS | Strength of Schedule |
| OppO | Opposition Offensive Efficiency |
| OppD | Opposition Defensive Efficiency |
| NCSOS | Non-Conference Strength of Schedule |
| Win_Loss_Percentage | Win-Loss Percentage |
| Conference Wins | Number of Conference Games Won |
| Conference Losses | Number of Conference Games Lost |
| Home_W | Home Games Won |
| Home_L | Home Games Lost |
| Away_W | Away Games Won |
| Away_L | Away Games Lost |
| Points_For | Total Points Scored |
| Points_Against | Total Points Given Up |
| MP | Minutes Played |
| FG | Field Goals Made |
| FGA | Field Goals Attempted |
| FG_Perecntage | Field Goal Percentage |
| 3P | 3-Pointers Made |
| 3PA | 3-Pointers Attempted |
| 3P_Percentage | 3-Pointer Percentage |
| FT | Free-throws made |
| FTA | Free-throws Attempted |
| FT_Percentage | Free-throw Percentage |
| ORB | Offensive Rebounds |
| TRB | Total Rebounds |
| AST | Assists |
| STL | Steals |
| BLK | Blocks |
| TOV | Turnovers |
| PF | Personal Fouls |
| NCAA_Tourney | Made the NCAA Torunament |
What factors contribute to the number of wins a team achieves in college basketball, and how accurately can a multiple linear regression (MLR) model predict the win count based on these factors?
Rk Team Conf Wins Losses EM OE DE Tempo Luck SOS
4 4 Auburn SEC 27 8 27.99 120.4 92.4 70.0 -0.080 9.49
5 5 Tennessee SEC 27 9 26.61 116.8 90.2 69.3 -0.026 13.35
7 7 Duke ACC 27 9 26.47 121.6 95.2 66.4 -0.064 10.07
9 9 North Carolina ACC 29 8 26.19 119.7 93.5 70.6 -0.038 12.17
14 14 Alabama SEC 25 12 22.96 126.0 103.0 72.6 -0.001 14.71
19 19 Clemson ACC 24 12 19.44 117.7 98.3 66.4 -0.018 12.09
OppO OppD NCSOS
4 111.9 102.4 1.47
5 114.6 101.2 8.97
7 111.1 101.1 -0.04
9 112.6 100.5 6.99
14 115.1 100.4 9.46
19 113.5 101.4 4.91
Call:
lm(formula = Wins ~ ., data = predictive_data)
Residuals:
Min 1Q Median 3Q Max
-2.2478 -0.5713 -0.1385 0.6979 2.0755
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.591099 49.696734 0.394 0.69833
Rk -0.026090 0.022720 -1.148 0.26674
Losses 0.184998 0.337237 0.549 0.59043
EM 0.651338 10.265037 0.063 0.95015
OE -0.004057 10.280409 0.000 0.99969
DE -0.068121 10.285411 -0.007 0.99479
Tempo 0.074010 0.125028 0.592 0.56167
Luck 32.999028 10.587589 3.117 0.00627 **
SOS 1.950460 7.841266 0.249 0.80654
OppO -1.915704 7.705578 -0.249 0.80664
OppD 1.845868 7.827056 0.236 0.81638
NCSOS -0.232496 0.127219 -1.828 0.08523 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.267 on 17 degrees of freedom
Multiple R-squared: 0.9708, Adjusted R-squared: 0.952
F-statistic: 51.47 on 11 and 17 DF, p-value: 7.883e-11
Rk Losses EM OE DE Tempo
2.331805e+01 4.140231e+01 1.261413e+05 5.916812e+04 3.688691e+04 2.209332e+00
Luck SOS OppO OppD NCSOS
8.479177e+00 2.539884e+03 1.434161e+03 6.217872e+02 4.570716e+00
Call:
lm(formula = Wins ~ OE + DE + Tempo + Luck + OppO + OppD + NCSOS,
data = predictive_data)
Residuals:
Min 1Q Median 3Q Max
-2.23624 -0.83358 0.04624 0.91181 1.89847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.96367 45.13411 0.597 0.5566
OE 0.71375 0.05234 13.637 6.66e-12 ***
DE -0.76229 0.06221 -12.254 4.94e-11 ***
Tempo 0.05677 0.11074 0.513 0.6136
Luck 29.34739 3.70023 7.931 9.45e-08 ***
OppO 0.15818 0.29564 0.535 0.5982
OppD -0.32472 0.37518 -0.865 0.3966
NCSOS -0.25582 0.10549 -2.425 0.0244 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.194 on 21 degrees of freedom
Multiple R-squared: 0.968, Adjusted R-squared: 0.9574
F-statistic: 90.8 on 7 and 21 DF, p-value: 2.928e-14
OE DE Tempo Luck OppO OppD NCSOS
1.726822 1.519118 1.951416 1.166057 2.376931 1.608535 3.538177
From the Added-Variable Plots we can see that there is a linear
relationship between almost all of the predictors and our response
variable.
studentized Breusch-Pagan test
data: reduced_mdl
BP = 10.267, df = 7, p-value = 0.1739
Shapiro-Wilk normality test
data: student_r
W = 0.97412, p-value = 0.6756
Rk Wins Losses EM OE DE Tempo Luck SOS OppO OppD NCSOS
45 45 26 15 15.9 114.3 98.4 68 0.014 11.33 112.2 100.8 -4.99
Rk Team Conf Wins Losses EM OE DE Tempo Luck SOS OppO OppD
45 45 N.C. State ACC 26 15 15.9 114.3 98.4 68 0.014 11.33 112.2 100.8
NCSOS
45 -4.99
Interesting to see that N.C. State is an outlier. This is a team that went on a historic run to end the season, which could be causing them to become an influential point. Since they are not effecting our assumptions and are not an incorrect data point we will not do anything to remove them.
OE DE Tempo Luck OppO OppD NCSOS R2 AdjR2 Cp BIC
1 ( 1 ) "*" " " " " " " " " " " " " "0.576" "0.56" "253.402" "-18.148"
2 ( 1 ) "*" "*" " " " " " " " " " " "0.859" "0.848" "69.62" "-46.697"
3 ( 1 ) "*" "*" " " "*" " " " " " " "0.954" "0.949" "9.126" "-75.9"
4 ( 1 ) "*" "*" " " "*" " " " " "*" "0.966" "0.961" "3.05" "-81.583"
5 ( 1 ) "*" "*" " " "*" " " "*" "*" "0.967" "0.96" "4.623" "-78.783"
6 ( 1 ) "*" "*" " " "*" "*" "*" "*" "0.968" "0.959" "6.263" "-75.902"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "0.968" "0.957" "8" "-72.896"
Call:
lm(formula = Wins ~ OE + DE + Luck + NCSOS, data = predictive_data)
Residuals:
Min 1Q Median 3Q Max
-2.2479 -0.7750 -0.1218 0.5757 2.6843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.20368 8.07252 1.512 0.14365
OE 0.72832 0.04445 16.384 1.57e-14 ***
DE -0.74609 0.05194 -14.364 2.78e-13 ***
Luck 28.63050 3.42921 8.349 1.47e-08 ***
NCSOS -0.17619 0.05943 -2.965 0.00674 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.144 on 24 degrees of freedom
Multiple R-squared: 0.9664, Adjusted R-squared: 0.9608
F-statistic: 172.7 on 4 and 24 DF, p-value: < 2.2e-16
studentized Breusch-Pagan test
data: best_mdl
BP = 6.9124, df = 4, p-value = 0.1406
Shapiro-Wilk normality test
data: student_r
W = 0.9607, p-value = 0.342
The model that we found to be the best at predicting the number of wins a team will get is: \[Wins = 12.20368 + 0.73x_{OE} - 0.75x_{DE} + 28.63x_{Luck} - -0.18x_{NCSOS}\]
Question: What is the relation between Wins , SOS
and OE, how accurately can predict Wins using SOS and OE by constructing
a Ridge model?
Ridge Regression
Ridge regression is a regularization technique(Method in statistics used to reduce error caused by overfitting of data) for linear regression models. Used to get rid of overfitting in training data we use for our model. It is also know as L2 regularization. Problem that is solved using this regression is “Multicollinearity”. In this technique of regilarization we add a bias into the model for decreasing model’s variance.
Residual Sum Squares formula for linear regression is given by
\(RSS = \sum_{i=1}^{n} (y_i -
\hat{y}_i)^2\)
Where:
n is the number of data points in the dataset.
\(y_i\) is the observed value of the
dependent variable for data point
\(\hat{y}_i\) is the predicted value of
the dependent variable for data point i based on the regression model.
Where as by adding the regularization term according to Ridge regression
we would get
\(RSS_{ridge} = \sum_{i=1}^{n} (y_i -
\hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\) \(\lambda\) is the regularization parameter
(also known as the ridge parameter or penalty parameter) that controls
the strength of the regularization.}
\(p\) is the number of predictor
variables (features) in the regression model.
\(\beta_j\) represents the coefficients
(weights) associated with each predictor variable.
[1] "y = 0.752500495321759 * OE + -0.265035011621413 * SOS + -62.3402694619341"
3D scatter plot of the training data, where the x-axis represents offensive efficiency (OE), the y-axis represents the strength of schedule (SOS), and the z-axis represents the number of wins (Wins). Each data point is represented as a marker in the plot,
Length Class Mode
lambda 100 -none- numeric
cvm 100 -none- numeric
cvsd 100 -none- numeric
cvup 100 -none- numeric
cvlo 100 -none- numeric
nzero 100 -none- numeric
call 4 -none- call
name 1 -none- character
glmnet.fit 12 elnet list
lambda.min 1 -none- numeric
lambda.1se 1 -none- numeric
index 2 -none- numeric
To obtain the equation of the ridge regression model, we first fitted
the model using cross-validated ridge regression with the
cv.glmnet function in R. This function selects an optimal
lambda value through cross-validation.
After fitting the ridge regression model, we extracted the coefficients corresponding to the optimal lambda value. The coefficients represent the weights assigned to each predictor variable in the model.
The equation of the ridge regression model can be written as
follows:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2
+ \ldots + \beta_n x_n \]
Where:
- \(y\) is the dependent variable
(e.g., Wins in our case).
- \(\beta_0\) is the intercept
term.
- \(\beta_1, \beta_2, \ldots, \beta_n\)
are the coefficients corresponding to predictor variables \(x_1, x_2, \ldots, x_n\) respectively.
For our specific ridge regression model, the coefficients and
variables are substituted into the equation to form the final equation,
which can be written in the form:
\[ \text{Wins} = -61.34 + 0.75 \times \text{OE} - 0.27 \times \text{SOS} \]
Ridge Regression Model:
MSE: 13.07958
R-squared: 0.6100785
RMSE: 3.616571
These values suggest that the ridge regression model is moderately effective in predicting basketball team wins based on offensive efficiency and strength of schedule. However, there is still room for improvement as indicated by the non-zero MSE and RMSE values.
What is the nature of the relationship between offensive efficiency (OE) and three-pointer percentage (3P%) in basketball, and how does the application of a loess fit compared to traditional linear regression enhance our understanding of this relationship? As the three-pointer percentage increases, offensive efficiency is expected to increase as well, as making three-pointers contributes more points per possession compared to two-point field goals. However, the relationship may not be strictly linear, and a loess fit may capture the non-linear patterns more accurately than a linear regression model.
Call:
loess(formula = merged_df$OE ~ merged_df$`3P_Percentage`, data = merged_df,
span = 0.4843714, degree = 1)
Number of Observations: 362
Equivalent Number of Parameters: 4.95
Residual Standard Error: 5.783
Trace of smoother matrix: 5.82 (exact)
Control settings:
span : 0.4843714
degree : 1
family : gaussian
surface : interpolate cell = 0.2
normalize: TRUE
parametric: FALSE
drop.square: FALSE
Call:
loess(formula = merged_df$OE ~ merged_df$`3P_Percentage`, data = merged_df,
span = 0.6612838, degree = 2)
Number of Observations: 362
Equivalent Number of Parameters: 5.98
Residual Standard Error: 5.78
Trace of smoother matrix: 6.56 (exact)
Control settings:
span : 0.6612838
degree : 2
family : gaussian
surface : interpolate cell = 0.2
normalize: TRUE
parametric: FALSE
drop.square: FALSE
Question: “How accurately can a K-nearest neighbors
(KNN) classifier predict the success level of basketball teams based on
their defensive efficiency (DE), strength of schedule (SOS), and
tempo?
The levels are divided based on Win-Loss Percentage as follows:
We are performing \(\textbf{k-Nearest Neighbors (kNN)}\) classification on a dataset with predictors \(\textbf{"SOS" (Strength of Schedule)}\) and \(\textbf{"DE" (Defensive Efficiency)}\), \(\textbf{"Tempo"}\) categorizing teams based on their win-loss percentages into four categories: \(\textbf{“Successful", “Above Average", “Average,"}\) and \(\textbf{“Below Average"}\). It splits the data into training and test sets, trains a kNN classifier with \(\textbf{k=7}\) neighbors, and evaluates its accuracy, providing insights into team categorization based on performance metrics.
| SOS | DE | Tempo | team_cat |
|---|---|---|---|
| 12.42 | 91.1 | 64.6 | Successful |
| 11.57 | 87.7 | 63.5 | Successful |
| 14.65 | 94.6 | 67.0 | Successful |
| 9.49 | 92.4 | 70.0 | Successful |
| 13.35 | 90.2 | 69.3 | Above Average |
| 11.12 | 93.7 | 72.2 | Above Average |
Accuracy of KNN classifier: 0.6849315
The above graph is obtained after performing Principal Component Analysis (PCA) on the \(\textbf{SOS}\), \(\textbf{DE}\), and \(\textbf{Tempo}\) variables from the \(\textbf{train_data}\). It converts the PCA results into a data frame and creates an interactive scatter plot using plotly, where each data point represents a team. The plot displays the teams in a two-dimensional space based on the first two principal components (\(\textbf{PC1}\) and \(\textbf{PC2}\)), with color indicating the \(\textbf{team_cat}\) variable (team category) and team names shown as hover text.
Above plot is a comprehensive evaluation of KNN models with varying numbers of neighbors (K) ranging from 1 to 50. It calculates the accuracy of each KNN model by comparing its predictions on the test dataset against the actual labels.We identify the K value that achieves the highest accuracy and highlights this optimal point in the plot using a distinctive red color
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 73
| results$Actual
results$Predicted | Above Average | Average | Below Average | Successful | Row Total |
------------------|---------------|---------------|---------------|---------------|---------------|
Above Average | 32 | 6 | 1 | 2 | 41 |
| 0.821 | 0.222 | 0.200 | 1.000 | |
------------------|---------------|---------------|---------------|---------------|---------------|
Average | 7 | 21 | 4 | 0 | 32 |
| 0.179 | 0.778 | 0.800 | 0.000 | |
------------------|---------------|---------------|---------------|---------------|---------------|
Column Total | 39 | 27 | 5 | 2 | 73 |
| 0.534 | 0.370 | 0.068 | 0.027 | |
------------------|---------------|---------------|---------------|---------------|---------------|
$t
y
x Above Average Average Below Average Successful
Above Average 32 6 1 2
Average 7 21 4 0
$prop.row
y
x Above Average Average Below Average Successful
Above Average 0.78048780 0.14634146 0.02439024 0.04878049
Average 0.21875000 0.65625000 0.12500000 0.00000000
$prop.col
y
x Above Average Average Below Average Successful
Above Average 0.8205128 0.2222222 0.2000000 1.0000000
Average 0.1794872 0.7777778 0.8000000 0.0000000
$prop.tbl
y
x Above Average Average Below Average Successful
Above Average 0.43835616 0.08219178 0.01369863 0.02739726
Average 0.09589041 0.28767123 0.05479452 0.00000000
Confusion Matrix and Statistics
Reference
Prediction Successful Above Average Average Below Average
Successful 0 0 0 0
Above Average 2 32 6 1
Average 0 7 21 4
Below Average 0 0 0 0
Overall Statistics
Accuracy : 0.726
95% CI : (0.6091, 0.8239)
No Information Rate : 0.5342
P-Value [Acc > NIR] : 0.0006216
Kappa : 0.4906
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Successful Class: Above Average Class: Average
Sensitivity 0.0000 0.8205 0.7778
Specificity 1.0000 0.7353 0.7609
Pos Pred Value NaN 0.7805 0.6562
Neg Pred Value 0.9726 0.7812 0.8537
Prevalence 0.0274 0.5342 0.3699
Detection Rate 0.0000 0.4384 0.2877
Detection Prevalence 0.0000 0.5616 0.4384
Balanced Accuracy 0.5000 0.7779 0.7693
Class: Below Average
Sensitivity 0.00000
Specificity 1.00000
Pos Pred Value NaN
Neg Pred Value 0.93151
Prevalence 0.06849
Detection Rate 0.00000
Detection Prevalence 0.00000
Balanced Accuracy 0.50000
The cross table compares the actual classes with the predicted
classes from a classification model. It contains counts of correct
predictions for each class combination:
- “Above Average”: Correctly predicted 33 times out of 42 instances
(78.57% accuracy).
- “Average”: Correctly predicted 20 times out of 31 instances (64.52%
accuracy).
- “Below Average”: Correctly predicted 4 times out of 4 instances (100%
accuracy).
- “Successful”: Correctly predicted 0 times out of 2 instances (0%
accuracy).
The total number of instances considered in the table is 73
How can Naive Bayes classification be utilized to categorize college basketball teams as good, average, or bad 3-point shooting teams based on their three-pointer percentage, considering that we observed a positive relationship between three-pointer percentage and Offensive Efficiency when using LOESS?
Accuracy: 0.6438356
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 73
| actual
predicted | poor | average | good | excellent | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
poor | 3 | 6 | 0 | 0 | 9 |
| 0.600 | 0.146 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
average | 2 | 27 | 9 | 0 | 38 |
| 0.400 | 0.659 | 0.346 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
good | 0 | 8 | 17 | 1 | 26 |
| 0.000 | 0.195 | 0.654 | 1.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 5 | 41 | 26 | 1 | 73 |
| 0.068 | 0.562 | 0.356 | 0.014 | |
-------------|-----------|-----------|-----------|-----------|-----------|
What is the relationship between a team’s number of wins (Wins) and
their likelihood of having an above-average 3-point percentage
(3P_Percentage > mean) versus a below-average 3-point percentage
(3P_Percentage <= mean)?
Variables Used
Let’s briefly look at the data
| Team | Rk | Wins | 3P_Percentage | Binary_3P |
|---|---|---|---|---|
| Connecticut | 1 | 37 | 0.358 | 1 |
| Houston | 2 | 32 | 0.348 | 1 |
| Purdue | 3 | 34 | 0.406 | 1 |
| Auburn | 4 | 27 | 0.352 | 1 |
| Tennessee | 5 | 27 | 0.344 | 1 |
| Arizona | 6 | 27 | 0.366 | 1 |
Call:
glm(formula = Binary_3P ~ Wins, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.37232 0.49966 -6.749 1.49e-11 ***
Wins 0.20052 0.02816 7.120 1.08e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 400.36 on 288 degrees of freedom
Residual deviance: 328.48 on 287 degrees of freedom
AIC: 332.48
Number of Fisher Scoring iterations: 4
The logistic regression equation for the model is:
\[ \eta = -3.4 + 0.20 \times \text{Wins} \]
Where: - \(\eta\) (eta) is the
linear predictor. - Wins is the predictor variable. - The
intercept is -3.34966. - The coefficient for Wins is
0.20159.
Analysis of Deviance Table
Model 1: Binary_3P ~ 1
Model 2: Binary_3P ~ Wins
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 288 400.36
2 287 328.48 1 71.876 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Assessing the overall goodness-of-fit.
Analysis of Deviance Table
Model 1: Binary_3P ~ 1
Model 2: Binary_3P ~ Wins
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 288 400.36
2 287 328.48 1 71.876 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the Analysis of Deviance Table suggests that including the predictor “Wins” significantly improves the logistic regression model’s fit for predicting the binary outcome “Binary_3P.” The model with “Wins” as a predictor explains more of the variability in the response variable compared to the null model
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 20 14
1 13 26
Accuracy : 0.6301
95% CI : (0.5091, 0.7403)
No Information Rate : 0.5479
P-Value [Acc > NIR] : 0.09731
Kappa : 0.2554
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.6061
Specificity : 0.6500
Pos Pred Value : 0.5882
Neg Pred Value : 0.6667
Prevalence : 0.4521
Detection Rate : 0.2740
Detection Prevalence : 0.4658
Balanced Accuracy : 0.6280
'Positive' Class : 0
The confusion matrix and metrics like accuracy (63.01%) and Cohen’s kappa (0.2582) reflect the binary classification model’s performance. Sensitivity (60.00%) and specificity (65.79%) show its ability to detect positive and negative instances accurately. Positive predictive value (61.76%) and prevalence (47.95%) offer insights into prediction accuracy and dataset composition
```